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(54) A reconfigurable signal processor with embedded flash memory device 



(57) The present invention relates to a dynamically 
reconfigurable processing unit (1) including an embed- 
ded Flash memory device (3) for non-volatile storage of 
code, data and bit-streams, the unit (1 ) being integrated 
into a single chip together with a microprocessor (2) 



core. Advantageously, the processing unit further com- 
prises an S-RAM based embedded FPGA unit struc- 
tured for FPGA reconfigurations having a specific pro- 
gramming interface (7) connected to a port (FP) of said 
Flash memory device (4) through a DMA channel (8). 
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Description 

Field of invention 

[0001] The present invention relates to a dynamically 
reconfigurable processing unit tightly connected to a 
Flash EEPROM memory subsystem. 
[0002] More specifically, the invention relates to 
reconfigurable signal processing IC with an embedded 
Flash memory device for non-volatile storage of code, 
data and bit-streams, the unit being integrated into asin- 
gle chip together with a microprocessor core. 

Prior art 

[0003] As is well know by those skilled in this technical 
field, increasing complexity of system design and short- 
er time-to-market requirements are leading research to- 
wards the investigation of hybrid systems including 
processors enhanced by programmable logic. 
[0004] In this respect, reference is made to the work 
by Young-Don Bae et al., "A Single-Chip Programmable 
Platform Base on A Multithreaded Processor and Con- 
figurable Logic Clusters", ISSCC 2002 Digest of Tech- 
nical Papers, pp 336-337, Feb. 2002. 
[0005] Moreover, a further reference may be consid- 
ered the article by Zhang et al., having title: "A 1 V Het- 
erogeneous Reconfigurable Processor IC for Baseband 
Wireless Applications", ISSCC 2000 Digest of Technical 
Papers, pp 68-69,488, Feb. 2000. 
[0006] At the same time raising costs of mask sets 
and shorter time-to-market available for new products, 
are leading to the introduction of systems with a higher 
degree of programmability and configurability, such as 
system-on-chip with configurable processors, embed- 
ded FPGA and embedded flash memory. 
[0007] Moreover, the availability of an advanced em- 
bedded flash technology, based on NOR architecture, 
together with innovative IP's, like embedded flash mac- 
rocells with special features, is a key factor. 
[0008] For a better understanding of the present in- 
vention reference is also made to the Field Programma- 
ble Gate Array (FPGA) technology combining standard 
processors with embedded FPGA devices. 
[0009] These solutions allows to configure into the 
FPGA at deployment time exactly the required periph- 
erals, exploiting temporal re-use by dynamically recon- 
figuring the instruction-set at run time based on the cur- 
rently executed algorithm. 

[001 0] The existing models for designing FPGA/proc- 
essor interaction can be grouped in two main catego- 
ries: 

the FPGA is a co-processor communicating with the 
main processor through a system bus or a specific 
I/O channel; 

- the FPGA is described as a function unit of the proc- 



essor pipeline. 

[0011] The first group includes the GARP processor, 
known from the article by T. Callahan, J. Hauser, and J. 

5 Wawrzynek having title: "The Garp architecture and C 
compiler IEEE Computer, 33(4) : 62-69, April 2000. A 
similar architecture is provided by the A-EPIC processor 
that is disclosed in the article by S. Palem and S. Talla 
having title: "Adaptive explicit parallel instruction com- 

10 puting", Proceedings of the fourth Australasian compu- 
ter Architecture Conference (ACOAC), January 2001 . 
[0012] In both cases the FPGA is addressed via ded- 
icated instructions, moving data explicitly to and from 
the processor. Control hardware is kept to a minimum, 

'5 since no interlocks are needed to avoid hazards, but a 
significant overhead in clock cycles is required to imple- 
ment communication. 

[0013] Only when the number of cycles per execution 
of the FPGA is relatively high, the communication over- 
do head may be considered negligible. 

[001 4] In the commercial world, FPGA suppliers such 
as Altera Corporation offer digital architectures based 
on the US Patent No. 5,968,161 to T.J. Southgate, "FP- 
GA based configurable CPU additionally including sec- 
25 ond programmable section for implementation of cus- 
tom hardware support". 

[0015] Other suppliers (Xilinx, Triscend) offer chips 
containing a processor embedded on the same silicon 
IC with embedded FPGA logic. See for instance the US 
30 Patent 6,467,009 to S,P. Winegarden et al., "Configura- 
ble Processor System Unit", assigned to Triscend Cor- 
poration. 

[0016] However, those chips are generally loosely 
coupled by a high speed dedicated bus, performing as 

35 two separate execution units rather than being merged 
in a single architectural entity. In this manner the FPGA 
does not have direct access to the processor memory 
subsystem, which is one of the strengths of academic 
approaches outlined above. 

40 [0017] In the second category (FPGA as a function 
unit) we find architectures commercially known as: "PR- 
ISC"; "Chimaera" and "ConCISe". 
[0018] In all these models, data are read and written 
directly on the processor register file minimizing over- 

45 head due to communication. In most cases, to minimize 
control logic and hazard handling and to fit in the proc- 
essor pipeline stages, the FPGA is limited to combina- 
torial logic only, thus severely limiting the performance 
boost that can be achieved. 

50 [0019] These solutions represent a significant step to- 
ward a low-overhead interface between the two entities. 
Nevertheless, due to the granularity of FPGA operations 
and its hardware oriented structure, their approach is 
still very coarse-grained, reducing the possible resource 

55 usage parallelism and again including hardware issues 
not familiar nor friendly to software compilation tools and 
algorithm developers. 

[0020] Thus, a relevant drawback in this approach is 
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often the memory data access bottleneck that often forc- 
es long stalls on the FPGA device in order to fetch on 
the shared registers enough data to justify its activation. 
[0021 ] The technical problem of the present invention 
is that of providing a new kind of reconfigurable process- 
ing unit tightly connected to a memory architecture hav- 
ing functional and structural features capable to offer 
significant performance and energy consumption en- 
hancements with respect to a traditional signal process- 
ing device. 

Summary of invention 

[0022] The invention overcomes the limitations of 
similar preceding architectures relying on an embedded 
device of different nature, and a new approach to proc- 
essor/memory interface. 

[0023] According to a first embodiment of the present 
invention, the reconfigurable processing unit targets im- 
age-voice processing and recognition application do- 
mains by joining a configurable and extensible proces- 
sor core and an SRAM-based embedded FPGA. 
[0024] More specifically, the processing unit accord- 
ing to the invention further includes an S-RAM based 
embedded FPGA unit structured for FPGA reconfigura- 
tions having a specific programming interface connect- 
ed to a port FA of said Flash memory device through a 
DMA channel. 

[0025] The features and advantages of the process- 
ing unit according to this invention will become apparent 
from the following description of a best mode for carrying 
out the invention given by way of non-limiting example 
with reference to the enclosed drawings. 

Brief description of the drawings 

[0026] 

Figure 1 is a block diagram of a processing unit ar- 
chitecture for data processing according to the 
present invention; 

Figure 2 is a block diagram of a Flash memory ar- 
chitecture embedded into the processing unit of Fig- 
ure 1; 

Figure 3 is a schematic view of system memory hi- 
erarchy provided by the present invention; 

Figure 4 is a block diagram of a specific processor 
extension, for instance added DSP instructions ex- 
amples; 

Figure 5 is a block diagram of a further specific proc- 
essor extension, for instance an optimized fixed- 
point calculation of the square root accounts; 

Figure 6 is a table view showing the overall perform- 



ance improvements for a face recognition task im- 
plemented by the processing unit of the present in- 
vention; 

5 Figure 7 is a schematic chip micrograph. 
Detailed description 

[0027] With reference to the drawings views, gener- 
ic ally shown at 1 is a processing unit realized according 
to the present invention for digital signal processing 
based on reconfigurable computing. 
[0028] The processing unit 1 includes an embedded 
Flash memory device 4 for non-volatile storage of code, 
15 data and bit-streams and a further S-RAM based em- 
bedded FPGA unit 3 realized for the configuration pur- 
poses of the present invention. 
[0029] More specifically, a 8Mb application-specific 
embedded flash memory device 4 is disclosed. The 
20 memory device 4 is integrated into a single chip together 
with a microprocessor 2 and the FPGA structure "3. 
[0030] Advantageously, application-specific hard- 
ware units are added and dynamically modified by the 
embedded FPGA 3 reconfiguration. By implementing 
25 application-specific vector processing instructions the 
processing unit 1 shows a peak computing power of 
1GOPS. 

[0031 ] Efficient read-write-erase access to code, data 
and FPGA bitstreams is provided by the Flash memory 

30 device 4 based on a modular 8Mb, 4-bank Flash mem- 
ory, as will be more clearly explained hereinafter. 
[0032] The processing unit 1 comprises three con- 
tent-specific I/O ports and delivers an aggregate peak 
read throughput of 1 .2GB/s. 

35 [0033] The system architecture 1 is illustrated in Fig- 
ure 1. 

[0034] The functional purposes of the embedded FP- 
GA 3 are: 

40 j) extension of the processor datapath supporting a 
set of additional special-purpose C-callable micro- 
processor instructions; 

ii) bus-mapped coprocessors, connected to the sys- 
^5 tern bus through a master/ slave interface; 

iii) flexible I/O to connect external units or sensors 
with application-specific communication protocols. 

50 [0035] Even though such different circuit purposes 
would require different kinds of programmable logic for 
best implementation of either arithmetic-dominated or 
control-dominated logic, a single programmable logic 
subsystem 3 has been implemented to be shared 

55 among different purposes both in space (same config- 
uration) and time (subsequent configurations). 
[0036] The single, high I / O count, fine-grain e-FPGA 
3 operates as a datapath for the microprocessor pipeline 
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and as dedicated control logic for bus coprocessor and 
I/O control interface. The FPGA has a specific program- 
ming interface 7 connected to a port FP of said Flash 
memory device 4 through a DMA channel 8. 
[0037] FPGA reconfiguration is concurrent to soft- 
ware execution. 

[0038] A local bus 6 connects a dedicated 32-bit Flash 
memory port FP to the FPGA programming interface 7. 
[0039] A DMA channel 8 handles the bitstream trans- 
fer while microprocessor fetches instructions and data 
from different Flash memory ports: 64-bit wide code port 
(CP) and data port (DP). 

[0040] To support streaming applications a 1 kB dual- 
port buffer 9 is used to interface fast decoding hardware 
and slower software running on the processor 2. 
[0041 ] The memory sub-system architecture is shown 
in Figure 2. 

[0042] The modular structure of the memory (dotted 
line) includes: 

charge pumps 10 (Power Block); 

testability circuits 11 (DFT); 

a power management arbiter 12 (PMA); and, 

a customizable array 1 3 of N independent 2Mb flash 
memory modules 16. 

[0043] Depending on the storage requirements the 
number N may be chosen; N=4 in the current implemen- 
tation. 

[0044] The modular memory features (N+2) 128-bit 
target ports and implements a N-bank uniform memory 
13. 

[0045] As previously mentioned, three content-specif- 
ic ports are dedicated to code (CP, 64-bit wide), data 
(DP, 64-bit) and FPGA bit stream configurations (FP, 
32-bit). A 128 bit sub-system crossbar 15 connects all 
the architecture blocks and the eight bit microprocessor 
2. 

[0046] The main features of such the flash memory 
device 4 are: charge pump 10 sharing among different 
flash memory modules 16 through the PMA arbiter 12 
in a multi-bank fashion. Moreover, the use of a small 
eight bit micro processor 2 to easy memory system test 
and to add complex functionalities for data manage- 
ment, and the use of an ADC (Analog-to-Digital Con- 
verter), required by the application, to increase system 
self test capability. 

[0047] The third FP port of the Flash device 4 is ded- 
icated to manage embedded-FPGA (e-FPGA) configu- 
rations data stored in flash memory modules. The FP 
port is read-only and provides fast sequential access for 
bit streams downloading. 

[0048] The FP has four configuration registers repli- 
cating the information stored in CP port that must be 
used in order to write e-FPGA configurations data. 



[0049] The output data word bus and the address bus 
are 32 bits wide. The FP port uses a chip select to ac- 
cess in the addressable memory space, and a burst en- 
able to allow burst serial access. 

5 [0050] In read operation , an output ready signal is tied 
low when data are not immediately available, so that it 
can acts as a wait state signal. 
[0051] The eight-bit microprocessor 2 (uP) performs 
additional complex functions (defragmentation, com- 

10 pression, virtual erase, etc.) not natively supported by 
the DP port, and assists for built-in self test of the mem- 
ory system. 

[0052] The (N+2)x4 1 28-bit crossbar 1 5 connects the 
modular memory with the four initiators (CP, DP, FP and 
'5 uP) providing that at least three flash memory modules 
16 can be read in parallel at full speed. 
[0053] The memory space of the four modules 16 is 
arranged in three programmable user-defined parti- 
tions, each one devoted to a port. The memory system 
clock can run up to 1 00MHz, and reading three modules 
16 with 128bit data bus and 40ns access time, results 
in a peak read throughput of 1 .2GB/s. 
[0054] Each 2Mb flash memory module 16 has a 
128-bit IO data bus with 40ns access time, resulting in 
400Mbyte/s, and a program/erase control unit. Simulta- 
neous memory operations use the power management 
arbiter 12 (PMA) for optimal scheduling. 
[0055] Available power and user-defined priorities are 
considered to schedule conflicting resource requests in 
a single clock cycle. 

[0056] The memory device 4 allows up to four simul- 
taneous operations, with a limit of one both for write and 
erase. 

[0057] Figure 3 depicts the memory hierarchy and 
parallelism across the processing unit 1 . The ports CP 
and DP are interfaced to the 64-bit, 800MB/S AHB sys- 
tem bus 6. 

[0058] At a system clock rate of 1 00MHz each I/O port 
can independently operate at maximum speed. So, an 
aggregate peak read rate of 1 .2GB/s can be sustained 
as it is limited by memory access time. 
[0059] In the current implementation the e-FPGA 
reconfiguration takes 500ns at 100 MHz. 50MB/S aver- 
age throughput out of the available 400MB/S are cur- 
rently sustained by the e-FPGA configuration interface 
7. 

[0060] System performance is being evaluated for an 
image processing application (facial recognition) and a 
speech recognition application. 
[0061] More than 20 specific instructions were de- 
signed as C/assembly-callable functions, automatically 
translated to RTL, then synthesized and mapped to the 
e-FPGA. 

[0062] Figures 4 and 5 show two examples of specific 
microprocessor extensions. 

[0063] Figure 4 relates to an eight-issue, eight-bit, L2 
calculation accounts for 23 eight-bit arithmetic opera- 
tions and six 64-bit operations requiring about 1 0k ASIC 
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equivalent gates. 

[0064] Figures 5 relates to a datapath for an optimized 
fixed-point calculation of the square root accounts for 
twelve 32-bit operations for about 2k ASIC equivalent 
gates. 

[0065] The overall performance improvements for the 
face recognition tasks are shown in the table of Figure 6. 
[0066] Execution time is compared for 32-bit RISC 
with basic DSP extensions (MAC, zero-overhead loops, 
etc) and the same processor enhanced with application- 
specific instructions. 

[0067] Measured speed-ups range from 1 .8x to 1 0.6x 
(on the most-demanding task), with an overall improve- 
ment of 8.5x. It must be noticed that switching between 
algorithm stages requires only one reconfiguration of 
the e-FPGA. Reconfiguration time is negligible. 
[0068] The speed-up factors take into account the 
possible multi-cycle clock penalty due to processor-FP- 
GA synchronization in case of instruction extensions 
slower than the processor clock. Energy efficiency fig- 
ures are reported in Figure 6 too. 
[0069] As the average power consumption of the sys- 
tem extended with the e-FPGA is slightly higher 
(10-1 5%), the energy reduction for executing each of the 
tasks on its specific HW configuration (power-delay 
product improvement) results in an overall reduction of 
6.7x. 

[0070] Only one task showed slightly worse total ex- 
ecution energy, though showing benefits on execution 
speed. 

[0071 ] Last column of Figure 6 reports the energy-de- 
lay improvement of each specific HW configuration 
compared to the general-purpose counterpart. Energy 
required for e-FPGA reconfiguration is always negligi- 
ble. 

[0072] Measurements show the best energy efficien- 
cy in the range of several MOPS/mW at 1 ,8V supply. It 
lies between conventional ASIP/DSP and dedicated 
configurable hardware implementations. 
[0073] The full-processing unit on a single chip is im- 
plemented in a 0.18u.m, 2PL-6ML CMOS embedded 
Flash technology, chip area is 70mm 2 , technology and 
device characteristics are summarized in Figure 6 while 
a chip micrograph is shown in Figure 7. 



Claims 

1 . A dynamically reconfigurable processing unit (1 ) in- 
cluding an embedded Flash memory device (3) for 
non-volatile storage of code, data and bit-streams, 
the unit (1) being integrated into a single chip to- 
gether with a microprocessor (2) core, further com- 
prising an S-RAM based embedded FPGA unit 
structured for FPGA reconfigurations having a spe- 
cific programming interface (7) connected to a port 
(FA) of said Flash memory device (4) through a 
DMA channel (8). 



2. A dynamically reconfigurable processing unit ac- 
cording to claim 1 , wherein said DMA channel (8) 
handles the bitstream transfer while said microproc- 
essor (2) fetches instructions and data from differ- 

5 ent Flash memory ports of said Flash memory de- 
vice (4); a wide code port (CP) and a data port (DP). 

3. A dynamically reconfigurable processing unit ac- 
cording to claim 2, wherein said Flash memory de- 

w vice (4) includes a modular array structure (13) 
comprising N memory blocks (16), and wherein a 
power block (10). including charge pumps, is 
shared among different flash memory modules (1 6) 
through a PMA arbiter (12) in a multi-bank fashion. 

15 

4. A dynamically reconfigurable processing unit ac- 
cording to claim 1, wherein said embedded FPGA 
unit (3) exploits the following functions: 

20 iv) extension of the processor datapath sup- 

porting a set of additional special-purpose C- 
callable microprocessor instructions; 

v) bus-mapped coprocessors, connected to the 
25 system bus through a master/ slave interface; 

vi) flexible I/O to connect external units or sen- 
sors with application-specific communication 
protocols. 

30 

5. A dynamically reconfigurable processing unit ac- 
cording to claim 2, wherein said Flash memory de- 
vice (4) includes at least three different access 
ports, each for a specific function: 

35 

said code port (CP) optimized for random ac- 
cess time and the application system; 

said data port (DP) allowing an easy way to ac- 
40 cess and modify application data; and, 

said FPGA port (FP) offering a serial access for 
a fast download of bit streams for an embedded 
FPGA (e-FPGA) configurations. 

45 

6. A dynamically reconfigurable processing unit ac- 
cording to claim 2, wherein said third port (FP) com- 
prises four configuration registers replicating the in- 
formation stored in said code port (CP) that must be 

50 used in order to write e-FPGA configurations data. 

7. A dynamically reconfigurable processing unit ac- 
cording to claim 5, wherein said third port (FP) uses 
a chip select to access in the addressable memory 

55 space and a burst enable to allow burst serial ac- 
cess. 

8. A dynamically reconfigurable processing unit ac- 
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cording to claim 1, wherein said connection be- 
tween said interface (7) and said port (FA) is pro- 
vided by a local bus (6). 

A dynamically reconfigurabie processing unit ac- 
cording to claim 5, wherein said Flash memory de- 
vice (4) includes four modules (16) each arranged 
in at least three programmable user-defined parti- 
tions, each one devoted to a corresponding port. 
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